{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The bike rides dataset\n",
    "\n",
    "In this notebook, we will present the \"Bike Ride\" dataset. This dataset is\n",
    "located in the directory `datasets` in a comma separated values (CSV) format.\n",
    "\n",
    "We open this dataset using pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "cycling = pd.read_csv(\"../datasets/bike_rides.csv\")\n",
    "cycling.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first column `timestamp` contains a specific information regarding the the\n",
    "time and date of a record while other columns contain numerical value of some\n",
    "specific measurements. Let's check the data type of the columns more in\n",
    "details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cycling.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, CSV format store data as text. Pandas tries to infer numerical type by\n",
    "default. It is the reason why all features but `timestamp` are encoded as\n",
    "floating point values. However, we see that the `timestamp` is stored as an\n",
    "`object` column. It means that the data in this column are stored as `str`\n",
    "rather than a specialized `datetime` data type.\n",
    "\n",
    "In fact, one needs to set an option such that pandas is directed to infer such\n",
    "data type when opening the file. In addition, we will want to use `timestamp`\n",
    "as an index. Thus, we can reopen the file with some extra arguments to help\n",
    "pandas at reading properly our CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cycling = pd.read_csv(\n",
    "    \"../datasets/bike_rides.csv\", index_col=0, parse_dates=True\n",
    ")\n",
    "cycling.index.name = \"\"\n",
    "cycling.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cycling.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By specifying to pandas to parse the date, we obtain a `DatetimeIndex` that is\n",
    "really handy when filtering data based on date.\n",
    "\n",
    "We can now have a look at the data stored in our dataframe. It will help us to\n",
    "frame the data science problem that we try to solve.\n",
    "\n",
    "The records correspond at information derived from GPS recordings of a cyclist\n",
    "(`speed`, `acceleration`, `slope`) and some extra information acquired from\n",
    "other sensors: `heart-rate` that corresponds to the number of beats per minute\n",
    "of the cyclist heart, `cadence` that is the rate at which a cyclist is turning\n",
    "the pedals, and `power` that corresponds to the work required by the cyclist\n",
    "to go forward.\n",
    "\n",
    "The power might be slightly an abstract quantity so let's give a more\n",
    "intuitive explanation.\n",
    "\n",
    "Let's take the example of a soup blender that one uses to blend vegetable. The\n",
    "engine of this blender develop an instantaneous power of ~300 Watts to blend\n",
    "the vegetable. Here, our cyclist is just the engine of the blender (at the\n",
    "difference that an average cyclist will develop an instantaneous power around\n",
    "~150 Watts) and blending the vegetable corresponds to move the cyclist's bike\n",
    "forward.\n",
    "\n",
    "Professional cyclists are using power to calibrate their training and track\n",
    "the energy spent during a ride. For instance, riding at a higher power\n",
    "requires more energy and thus, you need to provide resources to create this\n",
    "energy. With human, this resource is food. For our soup blender, this resource\n",
    "can be uranium, petrol, natural gas, coal, etc. Our body serves as a power\n",
    "plant to transform the resources into energy.\n",
    "\n",
    "The issue with measuring power is linked to the cost of the sensor: a cycling\n",
    "power meter. The cost of such sensor vary from $400 to $1000. Thus, our data\n",
    "science problem is quite easy: can we predict instantaneous cyclist power from\n",
    "other (cheaper) sensors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"power\"\n",
    "data, target = cycling.drop(columns=target_name), cycling[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can have a first look at the target distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "target.plot.hist(bins=50, edgecolor=\"black\")\n",
    "plt.xlabel(\"Power (W)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see a pick at 0 Watts, it corresponds to whenever our cyclist does not\n",
    "pedals (descent, stopped). In average, this cyclist delivers a power around\n",
    "~200 Watts. We also see a long tail from ~300 Watts to ~400 Watts. You can\n",
    "think that this range of data correspond to effort a cyclist will train to\n",
    "reproduce to be able to breakout in the final kilometers of a cycling race.\n",
    "However, this is costly for the human body and no one can cruise with this\n",
    "power output.\n",
    "\n",
    "Now, let's have a look at the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can first have a closer look to the index of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that records are acquired every seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.index.min(), data.index.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The starting date is the August 18, 2020 and the ending date is September 13,\n",
    "2020. However, it is obvious that our cyclist did not ride every seconds\n",
    "between these dates. Indeed, only a couple of date should be present in the\n",
    "dataframe, corresponding to the number of cycling rides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.index.normalize().nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, we have only four different dates corresponding to four rides. Let's\n",
    "extract only the first ride of August 18, 2020."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "date_first_ride = \"2020-08-18\"\n",
    "cycling_ride = cycling.loc[date_first_ride]\n",
    "data_ride, target_ride = data.loc[date_first_ride], target.loc[date_first_ride]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_ride.plot()\n",
    "plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n",
    "_ = plt.title(\"Sensor values for different cyclist measurements\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the unit and range of each measurement (feature) is different, it is\n",
    "rather difficult to interpret the plot. Also, the high temporal resolution\n",
    "make it difficult to make any observation. We could resample the data to get a\n",
    "smoother visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_ride.resample(\"60s\").mean().plot()\n",
    "plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n",
    "_ = plt.title(\"Sensor values for different cyclist measurements\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the range of the different features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "axs = data_ride.hist(figsize=(10, 12), bins=50, edgecolor=\"black\", grid=False)\n",
    "# add the units to the plots\n",
    "units = [\n",
    "    \"beats per minute\",\n",
    "    \"rotations per minute\",\n",
    "    \"meters per second\",\n",
    "    \"meters per second squared\",\n",
    "    \"%\",\n",
    "]\n",
    "for unit, ax in zip(units, axs.ravel()):\n",
    "    ax.set_xlabel(unit)\n",
    "plt.subplots_adjust(hspace=0.6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From these plots, we can see some interesting information: a cyclist is\n",
    "spending some time without pedaling. This samples should be associated with a\n",
    "null power. We also see that the slope have large extremum.\n",
    "\n",
    "Let's make a pair plot on a subset of data samples to see if we can confirm\n",
    "some of these intuitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "rng = np.random.RandomState(0)\n",
    "indices = rng.choice(np.arange(cycling_ride.shape[0]), size=500, replace=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "subset = cycling_ride.iloc[indices].copy()\n",
    "# Quantize the target and keep the midpoint for each interval\n",
    "subset[\"power\"] = pd.qcut(subset[\"power\"], 6, retbins=False)\n",
    "subset[\"power\"] = subset[\"power\"].apply(lambda x: x.mid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "_ = sns.pairplot(data=subset, hue=\"power\", palette=\"viridis\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, we see that low cadence is associated with low power. We can also the\n",
    "a link between higher slope / high heart-rate and higher power: a cyclist need\n",
    "to develop more energy to go uphill enforcing a stronger physiological stimuli\n",
    "on the body. We can confirm this intuition by looking at the interaction\n",
    "between the slope and the speed: a lower speed with a higher slope is usually\n",
    "associated with higher power."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}